- R is an implementation of S language, created in 1976.
- A modified version of S (S-PLUS) was developed in 1993.
- Developed in 1991 by Ross Ihaka and Robert Gentleman and became free in 1995
1/5/2021
R can be downloaded here.
One of the most used ways to use R is through the free and open source (but there is also commercial version) Rstudio Desktop, which helps to control all the different entities in R.
Rstudio can be downloaded here.
75+4
## [1] 79
41-9
## [1] 32
99/12
## [1] 8.25
17*6
## [1] 102
75+4
## [1] 79
41-9
## [1] 32
99/12
## [1] 8.25
17*6
## [1] 102
9^3
## [1] 729
27%%3
## [1] 0
44%%7
## [1] 2
10e4==100000 24!=34 15>20 30<21
10e4==100000
## [1] TRUE
24!=34
## [1] TRUE
15>20
## [1] FALSE
30<21
## [1] FALSE
R is also able to manage character variables which must be surronded by “” or ’ ’
"Hello world!"
## [1] "Hello world!"
"I am a string"
## [1] "I am a string"
c(10, 30, 50,12)
## [1] 10 30 50 12
Note the c() function which means ‘concatenate’
c("I", "am", "a","string vector")
## [1] "I" "am" "a" "string vector"
Variables can be assigned to object using the “<-” or “=” symbol.
marks <- c(26, 30, 28,19)
basket <- c("peach", "orange", "cherries", "pineapple")
The content of any object can be shown just typing its name.
marks
## [1] 26 30 28 19
basket
## [1] "peach" "orange" "cherries" "pineapple"
We can check the class of any object with the function class(objectname)
class(marks)
## [1] "numeric"
class(basket)
## [1] "character"
Factors are variables in R which take on a limited number of different values; They are therefore Categorical, referring to categories that are called levels in R syntax.
myBiologicalType <- c("coding","UTR","intron","coding",
"UTR","regulatory","UTR")
myBiologicalType <- as.factor(myBiologicalType)
levels(myBiologicalType)
## [1] "coding" "intron" "regulatory" "UTR"
Although it’s not initially obvious, factors are very important. Consider, for example a vector of 100,000 entries summarizing the sex of a given species.
library(pryr)
sexes <- sample(c("male","female"),size = 100000,replace = T)
length(sexes)
## [1] 100000
object_size(sexes)
## 800,160 B
object_size(as.factor(sexes))
## 400,560 B
In R, objects can be modified in many ways:
let’s create two vectors:
vec1 <- 1:10 # What am I doing here? vec2 <- seq(from=100,to = 1000,by = 100) # And here?
The length of the vector can be retrieved through the length() function:
length (vec1)
## [1] 10
length(vec2)
## [1] 10
we can subset vectors indicating the index in square bracket
vec2 #show the full vector
## [1] 100 200 300 400 500 600 700 800 900 1000
vec2[3] # show the third element of the vector
## [1] 300
vec2[8] # show the eighth element of a vector
## [1] 800
we can subset vectors indicating the index in square bracket
vec2 #show the full vector
## [1] 100 200 300 400 500 600 700 800 900 1000
vec2[3:6] # show FROM the third element TO the sixth element of the vector
## [1] 300 400 500 600
vec2[c(2,5,9)] # show the 2nd,5th and 9th element
## [1] 200 500 900
We can replace values in vectors using this syntax:
vec2 #show the full vector
## [1] 100 200 300 400 500 600 700 800 900 1000
vec2[3] # show the third element
## [1] 300
vec2[3] <- 450 # replace the 3rd elemnt with 450 vec2[3]
## [1] 450
When working with vectors it is possible to apply an operation to all the variables at once:
vec2
## [1] 100 200 450 400 500 600 700 800 900 1000
vec2/2 # replace the 3rd elemnt with 450
## [1] 50 100 225 200 250 300 350 400 450 500
vec2-30
## [1] 70 170 420 370 470 570 670 770 870 970
When working with vectors it is possible to apply an operation to all the variables at once:
vec2
## [1] 100 200 450 400 500 600 700 800 900 1000
vec2*2 #show the full vector
## [1] 200 400 900 800 1000 1200 1400 1600 1800 2000
vec2+50 # show the third element
## [1] 150 250 500 450 550 650 750 850 950 1050
We can also perform operations between vectors:
vec2 #show vec2
## [1] 100 200 450 400 500 600 700 800 900 1000
vec1 # show vec1
## [1] 1 2 3 4 5 6 7 8 9 10
vec1*vec2 # Multiply vec1 and vec2
## [1] 100 400 1350 1600 2500 3600 4900 6400 8100 10000
We can also perform operations between vectors:
vec2 #show vec2
## [1] 100 200 450 400 500 600 700 800 900 1000
vec1 # show vec1
## [1] 1 2 3 4 5 6 7 8 9 10
vec1/vec2 # Divide vec2 and vec1
## [1] 0.010000000 0.010000000 0.006666667 0.010000000 0.010000000 0.010000000 ## [7] 0.010000000 0.010000000 0.010000000 0.010000000
We can also perform operations between vectors:
vec2 #show vec2
## [1] 100 200 450 400 500 600 700 800 900 1000
vec1 # show vec1
## [1] 1 2 3 4 5 6 7 8 9 10
vec1+vec2 # sum vec1 and vec2
## [1] 101 202 453 404 505 606 707 808 909 1010
We can also perform operations between vectors:
vec2 #show vec2
## [1] 100 200 450 400 500 600 700 800 900 1000
vec1 # show vec1
## [1] 1 2 3 4 5 6 7 8 9 10
vec1-vec2 # Subtract vec2 to vec1
## [1] -99 -198 -447 -396 -495 -594 -693 -792 -891 -990
When performing operations between vectors, the elements of the shorter one are recycled.
v1 <- 1:5 v2 <- c(2,3) v1*v2 # What is going on here?
## Warning in v1 * v2: longer object length is not a multiple of shorter object ## length
## [1] 2 6 6 12 10
x <- 1:10 # create a vector from 1 to 10 y <- 51:60 # create a vectro from 51:60 plot(x,y) #plot x and y
x <- 1:10 # create a vector from 1 to 10 y <- 51:60 # create a vector from 51:60 plot(x,y) # plot x and y lines(x,y,col="red") #add a red line
x <- rnorm(1000,mean=10) # create random normal
# distribution of 1,000 samples with mean=10
hist(x)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## [1,] 63 6 67 84 43 6 19 99 ## [2,] 67 49 13 9 91 21 75 86 ## [3,] 8 57 3 55 74 6 62 85 ## [4,] 83 21 49 67 88 70 61 34 ## [5,] 79 14 33 97 9 60 40 7
m=matrix(1:100,nrow=10) m
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 11 21 31 41 51 61 71 81 91 ## [2,] 2 12 22 32 42 52 62 72 82 92 ## [3,] 3 13 23 33 43 53 63 73 83 93 ## [4,] 4 14 24 34 44 54 64 74 84 94 ## [5,] 5 15 25 35 45 55 65 75 85 95 ## [6,] 6 16 26 36 46 56 66 76 86 96 ## [7,] 7 17 27 37 47 57 67 77 87 97 ## [8,] 8 18 28 38 48 58 68 78 88 98 ## [9,] 9 19 29 39 49 59 69 79 89 99 ## [10,] 10 20 30 40 50 60 70 80 90 100
m <- matrix(1:100,nrow=10,byrow = T) m
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 2 3 4 5 6 7 8 9 10 ## [2,] 11 12 13 14 15 16 17 18 19 20 ## [3,] 21 22 23 24 25 26 27 28 29 30 ## [4,] 31 32 33 34 35 36 37 38 39 40 ## [5,] 41 42 43 44 45 46 47 48 49 50 ## [6,] 51 52 53 54 55 56 57 58 59 60 ## [7,] 61 62 63 64 65 66 67 68 69 70 ## [8,] 71 72 73 74 75 76 77 78 79 80 ## [9,] 81 82 83 84 85 86 87 88 89 90 ## [10,] 91 92 93 94 95 96 97 98 99 100
class(m) # shows the class of the object m
## [1] "matrix" "array"
is.matrix(m) # shows TRUE if m is a matrix, FALSE otherwise
## [1] TRUE
dim(m) # shows the number of rows and column of object m
## [1] 10 10
nrow(m) # shows the number of rows ob object m
## [1] 10
ncol(m) # shows the number of columns of object m
## [1] 10
m[3,] # extract the third row
## [1] 21 22 23 24 25 26 27 28 29 30
m[,1] # extract the first column
## [1] 1 11 21 31 41 51 61 71 81 91
m[2:3,c(5,6,9)] # extract the values of the 2nd and third row
## [,1] [,2] [,3] ## [1,] 15 16 19 ## [2,] 25 26 29
# in the 5th, 6th and 9th column m[6,7] # extract the value of the 6th row and senth column
## [1] 57
An important R class is dataframe, which is characterised by high flexibility. In fact despite matrices, dataframes can contain multiple types of data
gene <- c("ADGRB2", "TARBP1", "SLC22A24", "SLC15A5",
"LOC117975386", "C17H17orf99", "TJP3", "TRIB3")
expression_in_cases <-c(100, 30, 35, 66,90, 200,12,8)
expression_in_controls <-c(10, 45, 55, 70,110, 80,25,85)
We can create a data.frame using the data.frame() function:
df <- data.frame(gene,expression_in_cases,expression_in_controls)
dataframes allow for data extraction using the same notations we saw before for matrices So, what the code needed to show only the first two rows of the data.frame?
An important R class is dataframe, which is characterised by high flexibility. In fact despite matrices, dataframes can contain multiple types of data
gene <- c("ADGRB2", "TARBP1", "SLC22A24", "SLC15A5", "LOC117975386", "C17H17orf99", "TJP3", "TRIB3")
expression_in_cases <-c(100, 30, 35, 66,90, 200,12,8)
expression_in_controls <-c(10, 45, 55, 70,110, 80,25,85)
We can create a data.frame using the data.frame() function:
df <- data.frame(gene,expression_in_cases,expression_in_controls)
So, what the code needed to show only the first two rows of the data.frame?
df[1:2,]
## gene expression_in_cases expression_in_controls ## 1 ADGRB2 100 10 ## 2 TARBP1 30 45
df[1:2,]
## gene expression_in_cases expression_in_controls ## 1 ADGRB2 100 10 ## 2 TARBP1 30 45
dataframes also allow to select specific columns usig the $ notations
df$gene #show the column "gene"
## [1] "ADGRB2" "TARBP1" "SLC22A24" "SLC15A5" "LOC117975386" ## [6] "C17H17orf99" "TJP3" "TRIB3"
df$expression_in_cases # show the column "expression_in_cases"
## [1] 100 30 35 66 90 200 12 8
A dataframe can be converted in matrix using the function as.matrix()
m <- as.matrix(df) class(m)
## [1] "matrix" "array"
We can convert a matrix to a datafram using as.dataframe()
df <- as.data.frame(m) class(df)
## [1] "data.frame"
We can also explore the structure of the object using str()
str(m) # show the structure of the object m
## chr [1:8, 1:3] "ADGRB2" "TARBP1" "SLC22A24" "SLC15A5" "LOC117975386" ... ## - attr(*, "dimnames")=List of 2 ## ..$ : NULL ## ..$ : chr [1:3] "gene" "expression_in_cases" "expression_in_controls"
str(df) # show the structure of the object df
## 'data.frame': 8 obs. of 3 variables: ## $ gene : chr "ADGRB2" "TARBP1" "SLC22A24" "SLC15A5" ... ## $ expression_in_cases : chr "100" " 30" " 35" " 66" ... ## $ expression_in_controls: chr " 10" " 45" " 55" " 70" ...
Here is a brief overview of some fundamental data structure in R
| object | modes | Allow_multiple_modes |
|---|---|---|
| vector | numeric, character, complex, logical | No |
| matrix | numeric, character, complex, logical | No |
| dataframe | numeric, character, complex, logical | yes |
set.seed(123)set.seed(123) #2. Type ```set.seed(123)``` m=rnorm(20,mean=0,sd=1) # 3. Create a vector **m** of length 20, composed by numbers drawn by a normal distribution with mean=0 and sd=1 n=m[1:5] # Create a vector **n** with the first 5 elements from the vector **m** hist(m) # 5. Plot an histogram of **m**
# 6. Create a matrix **mat** composed #by 10 rows and 10 columns with numbers from 201...300 mat <- matrix (data = 201:300,nrow=10,ncol=10) # 7. Replace the value located # at the second row, third column with 84 mat[2,3] <- 84
# 8. Plot a *scatterplot* of the second column of # mat in x axis and the and fifth column of mat on the y axis plot(mat[,2],mat[,5])
#9. Convert the matrix **mat** into a dataframe **df** df=as.data.frame(mat) #10. Show the column "V2" of the dataframe *df* using the **$** notation df$V2
## [1] 211 212 213 214 215 216 217 218 219 220
We are now ready to work on a real problem.
Let’s assume you got 1000 euros to do a pilot study on the genetic basis of drug addiction.
I want to be quick and efficient, and I need to prioritise my research towards genes that are ACTUALLY relate to drug addiction.
After hours and hours of literature review I found a list of 12 candidate genes, and their expression at 54 different tissues. They are saved on the file “gene_expression_data.txt”
gene_expression <- read.table(file = "./data/gene_expression_data.txt",
header = T,sep = "\t",check.names = F)
gene_expression[1:5,1:5]
## Name Description Adipose - Subcutaneous ## 1 ENSG00000114166.7 KAT2B 23.67930 ## 2 ENSG00000112038.17 OPRM1 0.00000 ## 3 ENSG00000170312.15 CDK1 4.05339 ## 4 ENSG00000138162.18 TACC2 10.08120 ## 5 ENSG00000189319.13 FAM53B 16.55860 ## Adipose - Visceral (Omentum) Adrenal Gland ## 1 20.10720 9.56261 ## 2 0.00000 0.00000 ## 3 2.76667 1.45045 ## 4 8.07309 4.56019 ## 5 13.62330 25.18740
gene_expression$Description
## [1] "KAT2B" "OPRM1" "CDK1" "TACC2" "FAM53B" ## [6] "FAM53B-AS1" "BRSK2" "ANKS1B" "MYH10" "CCDC42" ## [11] "SLC14A2" "SLC14A2-AS1"
colnames(gene_expression)[-c(1:2)]
## [1] "Adipose - Subcutaneous" ## [2] "Adipose - Visceral (Omentum)" ## [3] "Adrenal Gland" ## [4] "Artery - Aorta" ## [5] "Artery - Coronary" ## [6] "Artery - Tibial" ## [7] "Bladder" ## [8] "Brain - Amygdala" ## [9] "Brain - Anterior cingulate cortex (BA24)" ## [10] "Brain - Caudate (basal ganglia)" ## [11] "Brain - Cerebellar Hemisphere" ## [12] "Brain - Cerebellum" ## [13] "Brain - Cortex" ## [14] "Brain - Frontal Cortex (BA9)" ## [15] "Brain - Hippocampus" ## [16] "Brain - Hypothalamus" ## [17] "Brain - Nucleus accumbens (basal ganglia)" ## [18] "Brain - Putamen (basal ganglia)" ## [19] "Brain - Spinal cord (cervical c-1)" ## [20] "Brain - Substantia nigra" ## [21] "Breast - Mammary Tissue" ## [22] "Cells - Cultured fibroblasts" ## [23] "Cells - EBV-transformed lymphocytes" ## [24] "Cervix - Ectocervix" ## [25] "Cervix - Endocervix" ## [26] "Colon - Sigmoid" ## [27] "Colon - Transverse" ## [28] "Esophagus - Gastroesophageal Junction" ## [29] "Esophagus - Mucosa" ## [30] "Esophagus - Muscularis" ## [31] "Fallopian Tube" ## [32] "Heart - Atrial Appendage" ## [33] "Heart - Left Ventricle" ## [34] "Kidney - Cortex" ## [35] "Kidney - Medulla" ## [36] "Liver" ## [37] "Lung" ## [38] "Minor Salivary Gland" ## [39] "Muscle - Skeletal" ## [40] "Nerve - Tibial" ## [41] "Ovary" ## [42] "Pancreas" ## [43] "Pituitary" ## [44] "Prostate" ## [45] "Skin - Not Sun Exposed (Suprapubic)" ## [46] "Skin - Sun Exposed (Lower leg)" ## [47] "Small Intestine - Terminal Ileum" ## [48] "Spleen" ## [49] "Stomach" ## [50] "Testis" ## [51] "Thyroid" ## [52] "Uterus" ## [53] "Vagina" ## [54] "Whole Blood"
One way to decide to which gene put more effort (and money) I will look their expression in different brain tissues using the barplot(). I will focus on the following tissues
Brain - Cerebellum
barplot(gene_expression$`Brain - Cerebellum`,
names.arg = gene_expression$Description,las=2,
col="blue",cex.names = .8,ylab="transcript per million")
Brain - Cortex
barplot(gene_expression$`Brain - Cortex`,
names.arg = gene_expression$Description,las=2,
col="green",cex.names = .8,ylab="transcript per million")
Brain - Hippocampus
barplot(gene_expression$`Brain - Hippocampus`,
names.arg = gene_expression$Description,las=2,
col="red",cex.names = .8,ylab="transcript per million")
Brain - Hypothalamus
barplot(gene_expression$`Brain - Hypothalamus`,
names.arg = gene_expression$Description,las=2,
col="purple",cex.names = .8,ylab="transcript per million")
par(mfrow=c(2,2),mar=c(6,2.5,1.5,1))
barplot(gene_expression$`Brain - Cerebellum`,
names.arg = gene_expression$Description,las=2,
col="blue",main="Cerebellum")
barplot(gene_expression$`Brain - Cortex`,
names.arg = gene_expression$Description,las=2,
col="green",main="Cortex")
barplot(gene_expression$`Brain - Hippocampus`,
names.arg = gene_expression$Description,las=2,
col="brown",main="Hippocampus")
barplot(gene_expression$`Brain - Hypothalamus`,
names.arg = gene_expression$Description,las=2,
col="purple",main="Hypotalamus")